Representing numeric data in 32 bits while preserving 64-bit precision
نویسنده
چکیده
Data files often consist of numbers having only a few significant decimal digits, whose information content would allow storage in only 32 bits. However, we may require that arithmetic operations involving these numbers be done with 64-bit floating-point precision, which precludes simply representing the data as 32-bit floating-point values. Decimal floating point gives a compact and exact representation, but requires conversion with a slow division operation before the data can be used in an arithmetic operation. Here, I show that interesting subsets of 64-bit floating-point values can be compactly and exactly represented by the 32 bits consisting of the sign, exponent, and high-order part of the mantissa, with the lower-order 32 bits of the mantissa filled in by a table lookup indexed by bits from the part of the mantissa that is retained, and possibly some bits from the exponent. For example, decimal data with four or fewer digits to the left of the decimal point and two or fewer digits to the right of the decimal point can be represented in this way, using a decoding table with 32 entries, indexed by the lower-order 5 bits of the retained part of the mantissa. Data consisting of six decimal digits with the decimal point in any of the seven positions before or after one of the digits can also be represented this way, and decoded using a table indexed by 19 bits from the mantissa and exponent. Encoding with such a scheme is a simple copy of half the 64-bit value, followed if necessary by verification that the value can be represented, by checking that it decodes correctly. Decoding requires only extraction of index bits and a table lookup. Lookup in a small table will usually reference fast cache memory, and even with larger tables, decoding is still faster than conversion from decimal floating point with a division operation. I present several variations on these schemes, show how they perform on various recent computer systems, and discuss how such schemes might be used to automatically compress large arrays in interpretive languages such as R.
منابع مشابه
A PRNG specialized in double precision floating point numbers using an affine transition
We propose a pseudorandom number generator specialized to generate double precision floating point numbers. It generates 52-bit pseudorandom patterns supplemented by a constant most significant 12 bits (sign and exponent), so that the concatenated 64 bits represents a floating point number obeying the IEEE 754 format. To keep the constant part, we adopt an affine transition function instead of ...
متن کاملFloating point implementation of an FFT block – optimization issues
The IEEE 754 Floating point standard [1] specifies a simple (32 bits) and double precision (64 bits). For the FFT block in a DSL modem (DMT based), even 32 bits proves to be too much, and a custom number of bits can be used. This in turn will lead to savings in area and power on the ASIC implementation. Several techniques were considered for reducing the number of bits.
متن کاملWidening Integer Arithmetic
Some codes require computations to use fewer bits of precision than are normal for the target machine. For example, Java requires 32-bit arithmetic even on a 64-bit target. To run narrow codes on a wide target machine, we present a widening transformation. Almost every narrow operation can be widened by signor zero-extending the operands and using a target-machine instruction at its natural wid...
متن کاملHigh-Precision Arithmetic in Mathematical Physics
For many scientific calculations, particularly those involving empirical data, IEEE 32-bit floating-point arithmetic produces results of sufficient accuracy, while for other applications IEEE 64-bit floating-point is more appropriate. But for some very demanding applications, even higher levels of precision are often required. This article discusses the challenge of high-precision computation, ...
متن کاملPerformance Implications of Multiple Pointer Sizes
An increase in address size is a discontinuity: the Abstract fraction of ‘‘interesting’’ bits in a pointer shrinks a lot. Programs that were pushing the limits of the old Many users need 64-bit architectures: 32-bit sysaddress space consume almost none of the new adtems cannot support the largest applications, and 64dress space. bit systems perform better for some applications. However, perform...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1504.02914 شماره
صفحات -
تاریخ انتشار 2015